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Abstract 

■ This paper presents a sparse Bayesian inference approach that appHes to sparse signal representation 

" from overcomplete dictionaries in complex as well as real signal models. The approach is based on the 

. two-layer hierarchical Bayesian prior representation of the Bessel K probability density function for the 

> 



OO 

o 



variable of interest. It allows for the Bayesian modeUng of the £i-norm constraint for complex and real 



(N- signals. In addition, the two-layer model leads to novel priors for the variable of interest that encourage 

I more sparse representations than traditional prior models published in the literature do. An extension of 



the two-layer model to a three-layer model is also presented. Finally, we apply the fast Bayesian inference 
scheme by M. Tipping to the two- and three-layer hierarchical prior models to design iterative sparse 
estimators. We exploit the fact that the popular Fast Relevance Vector Machine (RVM) and Fast Laplace 
\ algorithms rely on the same inference scheme, yet on different hierarchical prior models, to compare 

^ . the impact of the utilized prior model on the estimation performance. The numerical results show that 

the presented hierarchical prior models for sparse estimation effectively lead to sparse estimators with 
improved performance over Fast RVM and Fast Laplace in terms of convergence speed, sparseness and 
achieved mean-squared estimation error. In particular, our estimators show superior performance in low 
and moderate signal-to-noise ratio regimes, where state-of-the-art estimators fail to produce sparse signal 
representations. 
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I. Introduction 

Compressive sensing and sparse signal representation have attracted the interest of an increasing number 
of researchers over the recent years (see e.g., ||Il-||4l). This has been motivated by the widespread 
appUcabiUty that such techniques find in a large variety of engineering areas. Generally speaking, these 
disciplines consider the following signal model: 

y = Ha + w. (1) 

In this expression, y is a M x 1 vector of measurement samples, H = [hi, . . . , is an M x L dictionary 
matrix with L > M basis vectors hi, I = 1, . . . , L. The additive term w is an M x 1 perturbation vector, 
which is assumed to be a white Gaussian random vector with mean equal to zero and covariance matrix 
A~^/, where A > denotes the noise precision parameter and I is the identity matrix. The sparsity 
aspect is reflected by the fact that the entries in the L x 1 unknown weight vector a = [oi, . . . ,0^]^ 
are mostly zero. By obtaining a sparse estimate of a we can accurately represent Ha with a minimal 
number of basis vectors. 

We coin the signal model ([T|) as either real, when H, a, and w are all real, or as complex, when H, 
a, and w are all complex 3 Historically, real signal models have dominated the research in sparse signal 
representation and compressive sampling. However, applications in which sparse estimation for complex 
signal models is sought are not uncommon in practice. An example is the estimation of multipath wireless 
channels The extension from sparse signal representation in real signal models to complex models 

is not always straightforward, as we will discuss in this paper. 

Sparse Bayesian learning (SBL) jSj, Q, ||8l applied to signal model ([T|) aims at finding a sparse 
maximum a posteriori (MAP) estimate of a 

«MAp(y) = argmin {p\\y - Ha\\l + X'^Q{a)] . (2) 

In this expression, || • ||p denotes the ip norm and the parameter p takes values p = 1/2 when the signal 
model © is real and p = I when it is complex. The estimate Qmap minimizes a weighted sum of 
two terms: the first term represents the squared-error of the reconstructed signal, while the second term 
Q{a) oif —\ogp{a) is a penalization term designed to enforce a sparse estimate of the weight vector 

'obviously, one could also consider a mixed model where, e.g., H and w are complex but a. is real. Since, applying our 
inference approach to this mixed model is straightforward, we discard it in this paper and focus on the two most relevant cases 
of real and complex signal models. 
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ctl^ In consequence, a variety of different estimators can be obtained by modifying the prior probability 
density function (pdf) p{a). Each of these estimators represents different compromises between accuracy 
of the reconstructed signal and sparseness of the estimate. Instead of working directly with the prior 
pdf p{a), SBL typically uses a two-layer (2-L) hierarchal prior model that involves a conditional prior 
pdf p{a\f) and a hyperprior pdf ^(7). The goal is then to select these pdfs in such a way that we can 
construct efficient yet computationally tractable iterative algorithms that estimate both the hyperparameter 
vector 7 and the weight vector a with the latter estimate being sparse. 

The 2-L prior model can be well understood from the perspective of a Gaussian scale mixture (GSM) 
model |[9l- lfT2l . where the entries in a are modeled as independent GSMs lfT3l . Specifically, ai is modeled 
as ai = where ui is a standard Gaussian random variable and 7; is a nonnegative random scaling 

factor, also known as the mixing variable and described by its mixing density ^(7/)!^ By choosing 
appropriate mixing densities various sparsity-inducing penalty terms in Q can be realized. 

One approach illustrating the use of the GSM model for SBL is the Relevance Vector Machine (RVM) 
Q, where the mixing densities are identical and equal to an inverse gamma pdfS This choice leads to the 
marginal prior pdf p{a) being a product of L student-t pdfs. An expectation-maximization (EM) algorithm 
is then formulated to estimate the hyperparameter vector 7 = [71, ... , 7l]^; as the estimate 7/ decreases 
it drives the corresponding weight estimate ai towards zero, thus encouraging a solution with only a 
few nonzero coefficients in a. Another approach is presented in lfT4l that realizes two popular sparsity- 
inducing penalization terms: the £i-norm and the log-sum. This is achieved by selecting identical mixing 
densities equal to respectively an exponential pdf and the density of the noninformative Jeffreys prior. In 
the former case, the marginal prior pdf p{a) is the product of Laplace pdfs and the corresponding MAP 
estimator Q solves the £i-norm constrained minimization problem. The same problem is also solved 
by the well-known Least Absolute Shrinkage and Selection Operator (LASSO) 1, 15 J and Basis Pursuit 
Denoising 11161 . Similarly to the RVM, an EM algorithm is used in lfT4l to estimate the hyperparameter 
vector 7. The resulting iterative algorithms are equivalent to special cases of the FOcal Underdetermined 
System Solver (FOCUSS) algorithm ifTTI . 

A significant Umitation of the sparse estimators in Q, lfT4l is that the EM algorithms they use are know 

^Here x cc" y denotes exp{x) = exp(i;) exp(y), and thus x = v + y, for some arbitrary constant v. We will also make use 
of a; oc y, which denotes x = vy for some positive constant v. 

^In this configuration, 7; can be seen as the variance for ai. 

''in the original formulation of the RVM algorithm in |7| the parameter 7; models the precision (inverse variance) of the 
conditional Gaussian prior pdf p(a;|7i) and the hyperprior pdf p(7;) is a gamma pdf. The model has been reformulated here 
in order to facilitate the comparison with the framework adopted in this paper. 
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to suffer from high computational complexity and slow convergence llTSl . Specifically, many iterations 
of the algorithm are needed before the entries of 7 become small enough to disregard the corresponding 
basis vectors. To circumvent this shortcoming, a fast inference scheme is proposed in i fTSl for RVM and 
later applied in |[T9l . The Fast Laplace algorithm |[T9ll is derived based on an augmented probabilistic 
model obtained by adding a third layer to the hierarchical structure of the Laplace pdf; this third layer 
introduces a hyper-hyperprior for the rate parameter of the exponential pdf. The algorithm in |[T9l can be 
seen as the Bayesian version of the £1 re-weighted LASSO algorithm |[20l . A similar three-layer (3-L) 
prior model leading to an adaptive version of the LASSO estimator is also proposed in |[211 . 

Even though the fast algorithms in ifTSl and |[T9l converge faster than their EM counterparts, they still 
suffer from slow convergence, especially in low and moderate signal-to-noise ratio (SNR) regimes as we 
will demonstrate in this paper. Furthermore, in these regimes the algorithms significantly overestimate 
the number of nonzero weights and therefore they do not accomplish sparse representations. We will 
show that this behavior results from the selection of the hierarchical prior models. 

Additionally, though complex GSM models have been proposed |[22l . |[23l . they have not been exten- 
sively studied for SBL in the literature. An example illustrating this fact is the hierarchical modeling of 
the ii penalty term. While it is well known that this penalty term results from selecting the exponential 
mixing density in real models, the same density will not induce this penalty term in complex models. 
Yet to the best of our knowledge, the complex GSM model realizing the ii penalty term has not been 
established in the Uterature. Motivated by the relative scarcity of formal tools for sparse learning in 
complex models and inspired by the recent developments of sparse Bayesian algorithms, we present a 
sparse Bayesian approach that applies to both real and complex signal models. 

We first present a 2-L hierarchical prior model inspired by the GSM model for both real and complex 
weights. By selecting each mixing density ^(7/), / = 1, . . . , L, to be a gamma pdf, a Bessel K pdf is 
obtained for the marginal prior of each of the entries in a j[T0l - |[T2l . The Bessel K pdf does not only 
encompass the modeling of the Laplace pdf as a special caseO but also introduces new degrees of freedom 
that allow for novel priors to be used in SBL. In the particular case where the dictionary matrix H is 
orthonormal, we demonstrate that the MAP estimator is a generalization of the soft-thresholding rule 
with degree of sparseness depending on the selection of the shape parameter of the gamma pdf ^(7/). 
Additionally, we show that the prior model has a strong connection to the Bayesian formalism of the 
group LASSO 11211 . |[24]| . A 3-L hierarchical model is also investigated that results from an extension 

^The Bessel K pdf is in turn a special case of even a larger class of generalized hyperbolic distributions [10], obtained when 
the mixing density is a Generalized Inverse Gaussian pdf. 
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of the 2-L model. For H orthonormal the MAP estimator is, in this case, a generaUzation of the hard- 
thresholding rule, with the shape parameter of ^(7^) determining again the sparseness property of the 
estimator. Finally, we apply the fast inference scheme in |18| to the 2-L and 3-L hierarchical structures 
to design sparse estimators for a. As these structures encompass those used in [18] and [19], the iterative 
algorithms derived in these publications can be seen as instances of our estimation algorithms. We provide 
numerical results that reveal the superior performance of our estimators with respect to convergence speed, 
sparseness, and mean-squared error (MSE) of the estimators. Most remarkably, the 2-L prior model leads 
to an estimator that outperforms the state-of-the-art Bayesian and non-Bayesian sparse estimators. 

The remainder of this paper is organized as follows. In Section|lIl we detail the 2-L and 3-L hierarchical 
prior models for real and complex weights and analyze their sparsity-inducing properties. Two fast 
Bayesian iterative algorithms based on the 2-L and 3-L models are proposed in Section |IIIJ where 
we also outline their relationship to the algorithms in ifTSl and |fT9l . A performance comparison of our 
two algorithms with related state-of-the-art sparse estimators obtained via Monte Carlo simulations is 
presented in Section UV] Finally, we conclude the paper in Section |Vl 

IL Bayesian Hierarchical Modeling of Sparsity Constraints 

We begin with the specification of the probabilistic structure of the SBL problem for signal model 
([T|l. Two types of hierarchical prior models for the weight vector a are considered: a 2-L and a 3-L 
hierarchical model. Later we will see that these models lead to priors for a with distinct sparsity-inducing 
properties. 

The joint pdf of the signal model O augmented with the 2-L prior model for a. reads 

p{y,a,f,X) = p{y\cx, X)p{X)p{cx\f)p{'y). (3) 



From ([T|), p{y\a,X) is Gaussian: N{y\Ha, X^^I) if the signal model is real and CN{y\Ha, X^^I) if 
it is complexjj The sparsity-inducing properties of this Bayesian model are determined by the joint prior 
pdf p{a\f)p{'y)- Motivated by GSM modeling as well as previous works on SBL Q, |[T4l . |[T9l we select 
the conditional prior pdf p{a\^) to factorize in a product of Gaussian pdfs: p^al-y) = Y[f=iPi^ihi)- 
For the sake of using a single notation for p(a/|7/) for both real and complex weights, we introduce the 

*N(-ja, B) and CN(-|a, B) denote respectively a multivariate real and a multivariate complex Gaussian pdf with mean vector 
a and covariance matrix B. 
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parameterization 




The conditional prior pdf p{ai\ji) for real ai is realized by selecting p = 1/2, while p = 1 entails the 
conditional prior pdf for complex a/. 

The hyperprior pdf ^(7) = ^(7; 77) of the 2-L prior model is selected to be some suitable pdf defined 
over with fixed parameter 77. An extension of a 2-L hierarchy with an additional layer, where the 
entries in r/ are also modeled as random variables, has been considered in |fT9l . EJ\ . Doing so yields 
the corresponding 3-L prior model with the joint pdf specified as follows: 

p{y, a, 7, T], A) = p{y\a, X)p{X)p{a\j)p{'y\'n)p{r]) , (5) 

where p{r]) of rj is the pdf of the "hyper-hyperprior". 

In the sequel we analyze the 2-L and 3-L prior models in more detail by computing the prior pdfs of 
the weight vector a obtained by marginalizing the joint prior pdfs p{a\f)p{'j) and p{cy.\^)p{'y\r])p{r]) 
with respect to the hyperparameters 7 and {7, r}} respectively. We then discuss the sparsity-inducing 
properties of these marginalized priors in the case when the weights are real and complex. This includes 
the modeling of the ii penalty term as a special case. 



A. Two-Layer Hierarchical Prior Model 

We begin with the hierarchical model of the Bessel K pdf. Specifically, ^(7) is selected as a product 
of gamma pdfs with individual rate parameters, i.e., ^(7) = Y[h=iPi'yi) '^^^^ P{li) = vilV^^i'li) — 
Ga(7;|e, r7/)0 Observe that while prior pdfs ^(7/), / = 1, . . . , L, are assigned individual rate parameters 
r]i, they have the same shape parameter e. Let us define r} = [r/i, . . . , tjl]^ . The pdf of the marginal prior 
of a is computed to be 

L 

,Vi) (6) 



/■oo 

p{a;e,r])= / p{a\f)p{'y;€,r])d-f = Y[piai;e, 

1=1 



with 



p{ai;e,r]i) = --—-r]^ \ai\' PK,_p{2y/^i\ai\). (7) 



^Ga(-|a, b) = r^"''" ^ exp{—bx) denotes a gamma pdf with shape parameter a and rate parameter 6. 
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In this expression, K,y{-) is the modified Bessel function of the second kind and order v £ M,. Note that 
^ is vaUd for both real (p = 1/2) and complex {p = 1) weight vector a. Let us study (t?) for these two 
cases in more detail. 

1) Real weights: With the selection p = 1/2, (Q coincides with the Bessel K pdf US, Ell- This pdf 
has been used for, e.g., modeling image probabilities ||25l . ||26l . Moreover, selecting e = 1 and making 
use of the identity Ki{z) = y^exp(— z) |27], (O simplifies to the Laplace pdf 



/ Vi I — 

p(az;e = l,r/;) = W — exp(-V2??H«d)' «/ G (8) 

The Laplace distribution for real weights is thereby obtained from a GSM model with an exponential 
mixing density |9|. 

2) Complex weights: We now consider ([7]) for the case when the weight ai is complex, i.e. p = 1. In 
contrast to the previous case, selecting e = 1 no longer leads to the Laplace pdf. Instead, a gamma pdf 
with the shape parameter e = 3/2 has to be used as the mixing density. With this choice, the modified 
Bessel function in ([7]) has order 1/2 and 

p(a;;e = 3/2,770 = ^exp(-2V^|a, I), a, G C, (9) 

TT 

which is the Laplace pdf in the complex domain. Note that the choice e = 3/2 has some connection with 
the group LASSO and its Bayesian interpretation 1211 . |[24l . where sparsity is enforced simultaneously 
over a group of variables. In the Bayesian interpretation of the group LASSO a gamma pdf with 
a shape parameter {mk + l)/2 is employed to model the prior for the group weights; naturally, if we 
let a group consist of the real and imaginary part of a complex variable, then ruk = 2 and hence, 
e = {mk + l)/2 = 3/2. 

To conclude, the 2-L prior model realizes the £i penalty term Q{oc;r]) = 2y/prj\\a.\\i with e = 1 for 
real a and with e = 3/2 for complex a when selecting r]i = t], I = 1, . . . , L. With these settings, the 
argument of the minimization in ([2]) particularizes to the LASSO cost function. 

Let us stress that (|7]) represents a family of prior pdfs for a parameterized by e and rj. In Fig. [T] 
we visualize the restrictioij^ to M of the prior pdf p{ai,e,rii) in d?]) with /> = 1 for various values 
of e. Observe the change of the shape of p{ai;e,r]i) with e: the smaller the value of e is the more 
rapidly p{ai,e, rji) decays around the origin. Moreover, with the selection e = 0, p{'yi) oc exp(— 7/^7^) 

^Let / denote a function defined on a set A. The restriction of / to a subset B C ^ is the function defined on B that 
coincides with / on this subset. 



April 4, 2012 



DRAFT 



8 




becomes improper, i.e., its integral is no longer finite. This in turn leads to the improper prior density 
p{ai) oc \ai\^PKp (^2^pr]i\ai\) which has a light tail and a pole at the originjfl This prior realizes a strong 
sparsity-inducing nature as we demonstrate next. 

Naturally, the 2-L prior model can be used with arbitrary values of e, leading to the general optimization 
problem ^ with penalty term 

L 

Q{a; e,r,) = Y,^og {lail'-PK^^p i2^i\ai\)) . (10) 
1=1 

To illustrate this consider Fig. |2(a)[ where the contour lines of the restriction to of Q{ai, 02; e, rj) c<f 
— log(p(ai; e,r/)p(a2; e,^/)) are plotted. Each contour line is computed for a specific choice of e. It can 
be seen that as e approaches more probability mass concentrates along the axes; as a consequence, 
the mode of the resulting posterior pdf p{oL\y;e,'q) is more likely to be found close to the axes, thus 
indicating a sparse solution. The behavior of the classical li penalty term obtained for e = 3/2 can also 
be clearly recognized. 

In order to get insight into the impact of e on the MAP estimator Q with penalty term ([TOl l. we 
follow the approach in |[T4l and consider the case when H is orthonormal, i.e., when hf^hk = 5k^i, 
k,l = 1, . . . ,L, where d^^i is the Kronecker delta. In this case the optimization (|2]| decouples into L 
independent scalar optimization problems. Furthermore, when the 2-L prior model realizes an £1 penalty 

'when a G R this prior simplifies to p{ai) oc jaij^^ exp{—^/2rfi\ai\). 
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(a) (b) 



Fig. 2. Two-layer hierarchical prior for the complex signal model: (a) Contour plot of the restriction to the Im{ai} — Im{a2} ~ 
- plane of Q{ai, Q2; e, rj) '^'^ ~ log(p(c*ii v)p{'^2', e, v))- (b) Restriction to lm{h^y} = of the MAP estimator with e as 
a parameter in the case when H is orthonormal. In (b) the black dashed line indicates the hard- threshold rule 1281 . 



term, i.e., the pdf dS) (real case) or ^ (complex case) is selected, the MAP estimator coincides with the 
soft-thresholding rule 

a«,MAp(y) = sign(^py)max|o, - A^^y^l , l = l,...,L, (11) 

where sign(x) = x/\x\ is the sign function. For an arbitrary value of e, we cannot compute Qmap in 
closed form. Therefore, we make use the EM algorithm as implemented in |[T4ll to approximate the MAP 
estimator. Figure [2(b)] visualizes the estimation rule for different values of e. Note that as e decreases, more 
components of the estimate are pulled towards zero since the threshold value increases, thus encouraging 
a sparser solution. 

B. Three-Layer Hierarchical Prior Model 

We now turn to the SBL problem with a 3-L prior model for cx which leads to the joint pdf dSjl of the 
augmented signal model. In contrast with the 2-L prior model, we consider the entries in rj as random 
and assume that p{r]) = Wi pim) with p{r]i) = p{r]i; ai,bi) = Ga{r]i\ai,bi), I = 1,...,L. 

We now compute the prior pdf p{a) by marginalizing p{a,^,r]) with respect to 7 and rj. First, we 
note that p{a,-f,r]) = p{a\-f)p{-f\r])p{r]) = flti ?'('^d7/)p(7d^/)p('?«) ^"d marginalize this pdf over 
77. This requires computing p{a\f)p{'y) = p{a\'y) f p{'y\r})p{r])dr]. Defining a = [ai, . . . ,ai\^ and 
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(a) (b) 



Fig. 3. Three-layer hierarchical prior for the complex signal model with the setting a = 1, b — 0.1: (a) Contour plot of the 
restriction to the Im{ai} = Im{a2} = - plane of the penalty term Q{ai, 02; e, a,b) ix^ —\og{p{a\\e,a,b)p{a2;e,a,b)). 
(b) Restriction to \m{h^y} = of the MAP estimator with e as a parameter in the case when H is orthonormal. In (b) the 
black dashed line indicates the hard-threshold rule and the black solid line the soft-threshold rule (TTJ. 



h = [hi, . . . , bif^ we obtain 

POO 

p{-f;€,a,b) =Y[ p{-fi\rji;e)p{rji;ai,bi)drii =Y[p{ir,e,ai,bi) (12) 

7 7 



L 

/jo 

with 



pili;e,a,,bO = %^^p^7rHli + bi)-^'-''''^. (13) 
Finally, marginalizing p{oL\'j)p{'y; e, a, b) over 7 yields 

L 

p{a;e,a,b) = Y{p{ai;e,ai,bi) (14) 

with 

p{ai-e,ai,bi)= p{ai\ji)p{'ji)d'ji 
Jo 

= UJ mnaO ['^) U[e + ar,e-p+l;p^). (15) 

In this expression, [/(•;•;•) is the confluent hypergeometric function jlTl . In Fig. |3(a)[ we depict the 
contour lines of the restriction to M? of Q(qi, 02; e, a, 6) oc^ — log(p(Qi; e, a, 6)^(02; e, a, 6)). Observe 
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that although the contours are qualitatively similar to those shown in Fig. 2(a) for the 2-L model, the 
corresponding estimation rules depicted in Fig. |3(b)| differ from those shown in Fig. |2(b)[ the 3-L prior 
model lead to a generalization of the hard-thresholding rule. 

References [21] and |19| also propose a 3-L prior model to derive a Bayesian adaptive version of the 
LASSO estimator for hierarchical adaptive LASSO and for the Fast Laplace algorithm respectively. There 
is, however, one important difference between these approaches and the one advocated in our work: the 
2-L and 3-L prior models presented here are not constrained to realize the £i penalty term, but can be 
used for arbitrary values of e. It is precisely this distinction that makes these hierarchical prior models 
lead to estimators that outperform classical -penalized sparse estimators, as our numerical evaluation 
in Section |IV] will reveal. 



C. The Jeffreys Noninformative Prior 

Note that a; = 6; = e = in (fT3] ) entails the density of the noninformative Jeffreys prior for 7^ and 
thereby the improper marginal prior density p{cx) oc H^i l^^d^''- The same holds for the 2-L prior model 
with the selection e = r/^ = 0. In other words, when the highest hierarchy layer in the prior formulation is 
chosen to be noninformative, the log-sum penalization Q{ol) = 2pJ2iLi log is invoked. This penalty 
term has gained much interest in the literature, including Q, lH, |[T4l . as it is known to strongly promote 
sparse estimates. We should also add that the £1 re-weighting scheme in |[20i has also been motivated 
from the log-sum penalty term. 

III. Sparse Bayesian Inference 

In this section we present the Bayesian inference scheme that relies on the 2-L and 3-L prior models 
presented in Section [III First, we exploit the EM framework to derive an iterative algorithm that estimates 
the unknown model parameters {a, 7,77, A} of the 3-L prior model. Then, a fast algorithm with low 
computational complexity inspired by [1181 is presented. The algorithm based on the 2-L model can 
easily be obtained by treating as a fixed parameter. 

A. Sparse Bayesian Inference using EM 

Direct computation of the estimate in Q is not always a straightforward task. In particular, for small 
e the penalty term Q{a) (see e.g., ( fTOl l) loses convexity and the calculation of the minimum of the 
objective function in Q becomes impractically complex. The hierarchical structure of the 2-L and 3-L 
prior models instead provides an efficient yet computationally tractable approach to compute an estimate 
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of a. This is achieved at the expense of introducing the additional hyperparameters, 7 and r/, that have 
to be estimated alongside with a. The sparse Bayesian inference scheme previously used for SBL (see 
fTl , |[8l . ifTSl . |fT9l . 1291 ) can also be adopted for our prior models. Its main steps are summarized below 
when it is applied to the 3-L prior model. The reader is referred to the above cited works for more 
information. The inference scheme exploits the particular factorization of the posterior pdf of the model 
parameters: 

p(q;,7,77, A|?/) =p(a|y,7,A)p(7,?7, A|2/). (16) 

The joint maximization of the left hand-side of ([T6l ) is not feasible. Instead, a two step approach is adopted 
consisting of a first step which computes the maximizer of the second factor, followed by a second step 
which calculates the maximizer of the first factor with {7, 77, A} set to the value already obtained in 
the first step. Note that the second step computes the MAP estimate of {7,//, A}, while the first step 
calculates the conditional MAP estimate of a given {7, 77, A} set to its MAP estimate. Unfortunately, 
the first step is computationally prohibitive, so it is replaced in the Bayesian inference scheme by an EM 
algorithm that sequentially computes an approximation of the MAP estimate of {7, 77, A}. The virtue of 
this modified scheme is that the conditional MAP estimate of cx computed in the second step is readily 
a byproduct obtained in the E-step of the EM algorithm. In our particular application context, we update 
the estimates of the components in {7, 77, A} sequentially, instead of jointly. The generalized EM (GEM) 
framework justifies this modification. 

The EM algorithm aims at approximating the MAP of {7, 77, A}, i.e. the maximizer of 

>C(7, v. A) = \ogp{y, 7, T7, A) = log(p(y I7, X)p{l\'n)p{r})p{X)) (17) 

by exploiting the fact that {q, y} is a complete data for {7, r], A}: 

p{yn,vA) = J p{y,a,'y,vA)da- (18) 

The E-step of the EM algorithm computes the conditional expectation 

{logp{y,a,j,ri,X)) (19) 

with respect to the conditional pdf p(Q!|y, 7, A) = N^Q|/i,S^ or p(Q!|y, 7, A) = CN ^q;|/x,S^ whether 
the underlying signal model is real or complex. Here, 7 and A denote the current estimates computed at 
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a previous iterationl^ In either cases, the parameters of the conditional pdf of a read 

-1 



£ = (XH^H + r ^) , (20) 



A = X^H^'y, (21) 

with r = diag(7i, . . . , 7l). The M-step of the EM algorithm updates the estimate of {7, 77, A} by 
equating it with the maximizer of ( fT9b . We can invoke the GEM framework to replace this step by a step 
updating the three components in {7, 77, A} sequentially. Specifically, 7 = argmax^(log(p(Q;|7)p(7|r/))), 
r) = argmax^ log(p(7|?7)p(?7)), ^^id A = argmax;^(log(p(?/|Q:, A)p(A))). Doing so we obtain 

. e-p-l + ^/ie-p-lr + 4pf,l(J^ 

11 = TT- ) / = 1,...,-L, (22) 

Vl= ^ , , ^ l = l,...,L, (23) 

li + h 

^ M 

{\\y-Hcy.\\i) 

In accordance with IH, |[T4ll . we have selected a constant (improper) prior for A, i.e., p(A) oc 1. This 
choice is motivated by the fact that the influence of the shape of the conjugate gamma pdf p{X) becomes 
small for a large M. 

In this formulation of the EM algorithm, we have chosen a as the hidden variable. Another approach 
followed in lfT4ll consists in letting 7 be the hidden variable and including the estimate of ct in the 
M-step. However, our numerical results indicate that the resulting EM algorithm leads to estimators that 
may exhibit an unstable behavior when small values of the parameter e are selected. These estimators 
are consistently converging to a local minimum and often fail to produce sparse estimates of ex. We have 
observed the same convergence behavior when this EM algorithm is applied to the probabilistic model 
using the noninformative Jeffreys prior for ^(7;), / = l,...,L, also proposed in |[T4l . As previously 
mentioned, this estimator coincides with the FOCUS S algorithm with the setting p = ifTTll (see lO for 
a thorough numerical analysis on different choices of p). Finally, another approach to estimate the model 
parameters is to use variational Bayesian inference techniques, see e.g., 0, |[30l . |[3T| . This approach 
has not been followed in this paper. 

'"To keep notation simple we avoid the use of an iteration index in the formulation of the EM algorithm: in the updating 
equations, the estimates on the left-hand side are the new update estimates, while those on the right-hand side are the current 
estimates. 
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B. Fast Sequential Sparse Inference Scheme 

Although the EM algorithm derived in the previous subsection yields simple expressions for the 
parameter updates, it suffers from two major drawbacks: first, the algorithm converges rather slowly, 
i.e., many iterations are needed before the entries of 7 corresponding to the zero-valued weights become 
small enough to remove the corresponding basis vectors, which reduces the computational complexity 
of the update (l20b . Secondly, unless A = 00, i.e., in a noiseless case, the entries of 7 are never exactly 
zero; they take very small, yet nonzero values, and thus an additional, usually empirical, thresholding 
procedure is needed to remove components with small weights in the inference procedure. 

In order to circumvent these drawbacks, we follow the framework proposed by [18|: rather than 
estimating the entries of 7 simultaneously for all components in the dictionary H, the objective function 
([TT]) is maximized sequentially with respect to the parameter 7/ of a single basis vector hi at a time, while 
keeping the parameters of the other basis vectors fixed. Thus, the parameters of all L components are 
updated sequentially, one after another. This approach leads to a much more efficient inference scheme 
that proves to accelerate the convergence compared to EM. It also provides conditions for determining 
when a component has to be removed, thus eliminating the need for a pruning threshold. 

Let us reinspect dTT] ) in more details. First we note that p(y|7,A) = J p{y\(y.,\)p{cx\'^)da in (ITtI ) 
can be computed in closed form. Since, both p{y\cy.,X) and p{(y.\^) are Gaussian pdfs, p{y\^,\) is 
likewise Gaussian with mean zero and covariance matrix C = X^^I + HTH^. Now, let us consider 
the dependency of ( [TT] ) on a single parameter 7;. We note that C = X^^I + Ylk^ti^khkh}^ + Jihihf^. 
Making use of the matrix inversion lemma [|32|| we express the inverse of C as 

-1 -1 C }h]h?C } 
= C-} ' 7' , (25) 

where C-i = X^^I + J^k^ilk^kh^. Following the GEM framework, we consider the terms in ([TT] ) 
depending on 7 while keeping r/ and A fixed to their current estimates: 

£(7) 4 £(7;r),A) = -plog|C_/| - py^CZfy + (e - 1) J] log7fe - J] 77^7^ 

kjtl k^l 

- plog |1 + 7//i; C_i hi\ + p— H^-i. + - 1) log^z - mil 

li + hi C_i hi 

= C{j_i)+e{^i). (26) 
The first summand C{'y_i) contains the terms independent of 7;. The terms depending on 7; are encom- 
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passed in the second summand: 

Kli) - -plog |1 + 7^5/1 + P^r ^ - 1) log7i - rjai (27) 

7z +si 

with the definitions si = h^C~lhi and qi = y^CZihi. Note that the definition domain of £{'yi) is M"*". 
We seek to compute the stationary points of £(7^). To this end, we take its derivative and equate the 
result to zero. Doing so yields the cubic equation 



= -iffjisf --if (1 - e + p)sf + 21)181 + 7z 2(e - l)s/ - sip + p\qi\^ - fji 



+ (e-l). (28) 



When 7z ranges through R, the equation has three solutions which can be determined analytically. While a 
feasible solution for 7/ is constrained to be positive, a further analysis is needed to determine which of the 
roots of the cubic polynomial have to be selected as a solution. We immediately note that lim^,_^oo ^(7/) = 
—00, while lim^,_i.o ^(7^) = (1 — e) • 00. In other words, depending on the sign of (1 — e), ^(7^) diverges 
either to plus or minus infinity when 7; tends to zero. In the following we consider several different 
choices for e. 

1) Stationary points of £{^1) for e < I: When e < 1, £(7^) diverges to plus infinity at the origin and 
hence, has no global maximum. In this case we turn to the following proposition: 

Proposition 1: Provided e < 1 the function ^(7/) has at most a single local maximum. 

Proof: See the Appendix. ■ 
Proposition [U states a very pleasing result as it provides us with an easy routine for selecting 7;. Either 
we select the local maximizer of ^(7;), provided it exists, or we prune the basis vector in H by setting 
71 = 0. 

2) Stationary points of £{ji) for e = 1: In this case ( [281 ) simplifies to 



= -liVisf - 11 



sfp + 2fiisi 



sip + p\qi?' - fii- (29) 



The two roots of the right-hand quadratic polynomial read 



2sirii 

(+) = -{sip + 2fj,) + ^r^ ^^^^ 
2sirii 

where = {sip + 2r]i)'^ — 4:fji{f]i + sip — plqil"^) and A > 0, with the equality holding if, and only if, 
\qi\ = Si = 0. While 7I ^ is always negative, 7^^^^ is positive if, and only if, \qi\'^ — si > p^^fji. Moreover, 
from liuiy^^Q £{'-fi) = and lim^^^j^oo ^(7;) = —00 we conclude that if 7|^-* is positive it corresponds to 
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the global maximum of £(7/). Thus, the solution 7; is obtained as 



2sifji 



if kH^ - si> p ^f]i 



11 



= < 



(32) 







elsewhere. 



The case e = 1 corresponds to the hierarchical modeling of the Laplace prior for real weights: (l32l ) 
coincides with the estimate of 7; obtained in [19] for the Fast Laplace algorithm when fji = . . . = fji = fj 
and p = 1/2. Obviously, (l32l ) can also be used for the complex case with p = I, yet in this case the 
equivalent prior for a is no longer Laplace, as we showed in Section [III but some other prior from the 
Bessel K family. 

The case e = 1 is also connected to the Fast RVM algorithm IfTSl . The selection e = 1 and r/ = 
[0, . . . ,0]^ corresponds to a constant hyperprior pdf pdlrj), so that the hyperparameter vector 7 is 
estimated via maximization of the type-II likelihood function p(y|7, A). Taking the derivative of (1271) and 
setting the result to zero leads to the update expression 



used in |[T8l . It is interesting to note that ( [33] ) is independent of p and thus is the same regardless whether 
the signal model ([T]) is real or complex. The same holds for the RVM algorithm. 

3) Stationary points of £{'~fi) for 1 < e < 1 + p.- Since £{'yi) diverges to minus infinity as 7; — and 
7; — )• 00, and is continuous, it must have a global maximum. In this case, we make use of Proposition |2] 

Proposition 2: Provided 1 < e < 1 + p the function £{ji) has a single stationary point and this 
stationary point corresponds to a maximum. 

Proof: See the Appendix. ■ 
Hence, the positive solution to (l28l) is the global maximizer of ^(7/). 

4) Stationary points of for e > 1 + p: This case is of a limited practical interest since the 
resulting penalty term is no longer sparsity-inducing. We thus leave out the corresponding analysis of 
the stationary points. 

C. Realization of the Fast Sequential Inference Scheme 

The fast inference scheme presented in Section IIII-BI can be realized using different approaches. For 
instance, one can start out with the "full" dictionary H and at each iteration of the algorithm remove a 




(33) 
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basis vector hi with a corresponding 7^ = 0. This effectively reduces the model complexity "on the fly". 
However, the first iterations still suffer from a high computational complexity due to the update (l20l ). 

To avoid this, we follow the approach in 1.18] , which consists in starting with an "empty" dictionary 
H and incrementally filling the dictionary by adding one basis vector at each iteration of the algorithm. 
More specifically, we start with an empty dictionary and "test" each basis vector by computing each 
7;, / = 1, . . . ,L, and adding the one vector hi in H with the corresponding 7^ that gives rise to the 
greatest increase in £(7/) between two consecutive iterations. The estimator derived based on the 3-L 
model updates fji according to ( |23] ). We then follow the add, delete and re-estimate procedures for S, 
fi, si, and qi given by |18 | (s/ and qi can be computed in a very efficient manner for all basis vectors). 
In the first update iterations, the estimate of A is likely to be inaccurate, as H contains only a few basis 
vectors. Therefore, the estimation of A is enabled after several update iterations have been performed 
only. In this case, A must be initialized to a fixed value in the first iterations. 

IV. Numerical Results 

In this section we analyze the performance of the proposed estimation algorithms in Section |III] 
The performance is evaluated by means of Monte Carlo simulations. Both real and complex signal 
models are considered in the investigations. We use a random M x L dictionary matrix H whose 
entries are independent and identically distributed (iid) zero-mean real (complex symmetric) Gaussian 
random variables with unit variance. A total of K components in a are nonzero, and the indices of 
these are uniformly drawn without repetition from the set {1, 2, . . . , L}. These indices together with K 
are unknown to the algorithms. The nonzero components in a are iid and drawn from a zero-mean real 
(complex circular) Gaussian distribution with unit variance. All reported curves are computed based on a 
total of 100 Monte Carlo runs. For each such run new realizations of the dictionary matrix H, the vector 
a, and the random perturbation vector w are generated. To further simplify the analysis of the simulation 
results we assume that the noise variance A^^ is known. The performance of the algorithms is evaluated 
based on the MSB of the estimator and the mean of the number K of nonzero components in a. These 
two performance metrics combined provide a strong indication of whether a given algorithm has properly 
inferred the support set of a, defined as the entries of the nonzero components in this vector. 

We refer to the proposed estimation algorithms using the 2-L and 3-L prior models as Fast-2L and 
Fast-3L respectively. In all simulations we select the free parameters of the prior models as summarized 
in Table |I] According to Section [III e affects the sparsity-inducing property of the hierarchical prior model 
and, correspondingly, the penalty term in Q. The value of the parameter e is appended to the acronyms 
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TABLE I 

The selected parameters for the proposed prior models presented in SectionHII 



Legend 


Model 


Parameters 


Fast-2L(e 


= 1/2) 


2-L 


e = 1/2, r/j = 1 VZ 


Fast-2L(e 


= 0) 


2-L 


e = 0, ?7; = 1 V/ 


Fast-3L(e 


= 1) 


3-L 


e = 0, ai = l ,bi = 0.1 VZ 



Fast-2L and Fast-3L. Note that the values of the rate parameters rji, I = 1, . . . , L, need to be specified for 
Fast-2L. Our investigations show that for e < 1 the performance of Fast-2L becomes largely independent 
of the choice of r/ in moderate to high SNR regimes; nonetheless, in low SNR regime a large value of 
r]i, I = 1, . . . , L results in overly sparse estimates of a. We select ryi = . . . = r/^ = 1 for Fast-2L. 

We compare the performance of Fast-2L and Fast-3L to the performance of the following state-of-the- 
art sparse estimators: 

1) EM-RVM: the RVM algorithm in JTl, Ulthat uses the EM algorithm for parameter estimation 
assuming a constant hyperparameter priori^ 

2) EM-Laplace: the EM algorithm based on the hierarchical model for Laplace prior in |[T4l . Note 
that EM-Laplace is equivalent to the FOCUS S algorithm (HI with p = 1. 

3) Fast-RVM: the algorithm in (IF 

12 



, 11291 . which is a fast sequential version of EM-RVM assuming a 

constant hyperparameter priori;' 

4) Fast-Laplace: the algorithm in ||T9l is an analogue to Fast-RVM derived based on the hierarchical 
Laplace prior model for real weights with an additional hyperprior for the rate parameter rj]^ 

5) SpaRSA and Re-SpaRSA: the sparse reconstruction by separable approximation (SpaRSA) algo- 



rithm for solving the LASSO cost function |i33 



14 



and a modified version of it (Re-SpaRSA) with 



the re-weighted ii minimization realized using the framework proposed in [201 . 
As a reference, we also consider the performance of the oracle estimator of a Ii34l that "knows" the 
positions of its nonzero entries. The oracle estimate reads 

{H^Ho)-^H^y , on supp(Q) 
«o(y) = { (34) 
, elsewhere, 



"The software is available on-line at |http://dsp.ucsd.edu/~dwipFl 

'"The software is available on-line at |http://people.ee.diike.edu/~lcarin/BCS.htmll 



'^The software is available on-line at 'http://ivpl.eecs.northwestem.edu/' 
'''The software is available on-line at http://w ww.lx.it.pt/~mtf/SpaRSA/| 
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where supp(Q:) = {/ | a/ 7^ 0} and Hq is an M x K dictionary matrix constructed from those columns 
of H that correspond to the nonzero entries in a. The MSE of the oracle estimator Qq for a fixed Ho 
and noise precision A is A^^traceKiJ^ i^o)^}- 

We expect the performance of a SBL iterative algorithm to be determined by two main factors: the 
used underlying prior model and the inference scheme applied to it. Fast-RVM, Fast-Laplace, Fast-2L, 
and Fast-3L are all obtained by using the same inference scheme, but applied to different hierarchical 
prior models. Hence, by comparing the performances of these algorithms we can draw conclusions on 
the sparsity-inducing properties of their respective prior models^ This is done in Section IIV-AI In 
Section IIV-BI we benchmark our proposed algorithms with the rest of the estimators in the above list. 

A. Impact of the Hierarchical Prior Model on the Performance 

To this end, we compare the performance of Fast-RVM, Fast-Laplace, Fast-2L, and Fast-3L. These 
estimators are all derived based on the same inference scheme yet applied to different hierarchical models. 
Fast-RVM and Fast-Laplace can be easily adapted to complex scenarios. Notice that Fast-Laplace does 
not make use of the Laplace prior in the complex case as earlier explained. The iterative nature of the 
algorithms requires a stopping criterion. We use the same stopping criterion for all algorithms based on 
the difference in £(7) between two consecutive iterations as done in [29|. 

We begin by evaluating the performance versus (i) the SNR per received signal component and (ii) 
the number of iterations used until the stopping criterion is met. The results, shown in Fig. |4l depict the 
sensitivity of the algorithms to measurement noise. In these investigations we set M = 100 and = 256; 
the number of nonzero components in oc is set to K = 20. 

Fast-2L(e = 0) achieves nearly the performance of the oracle estimator in terms of MSE for both real 
and complex models. Furthermore, Fast-RVM and Fast-Laplace perform as good as Fast-3L(e = 1). We 
also note that in high SNR regime (SNR > 60 dB) the performance of all compared estimators is almost 
indistinguishable. Yet in moderate SNR regime, the impact of the noise becomes significant and Fast-2L 
with small e clearly outperforms the other algorithms. The performance results for Fast-3L(e = 0) are 
not reported since they almost coincide with those for Fast-2L(e = 0). This indicates that it is mainly the 
selection of the shape parameter e in the second layer that dominates the performance of the algorithms, 
regardless of whether the 2-L or the 3-L prior model is used. 

'^Naturally, the practical implementation of the inference scheme also impacts the performance. Fast-Laplace, Fast-2L, and 
Fast-3L are all implemented based on the software for Fast-RVM. 
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Fig. 4. MSB versus SNR for (a) real signal model and (b) complex signal model. 



We now study the mean of K versus the SNR. The corresponding results are summarized in Figs. |5(a) 



and 5(d)[ Clearly, Fast-2L(e = 0) outperforms the other algorithms: K is unbiased for SNR as low as 
15 dB in the complex case. Fast-RVM and Fast-Laplace produce a significantly biased estimate of K in 
low to average SNR regime. Only for very high SNR values K is unbiased. 

In Figs. |5(b)| and 5(e) we plot the mean of K versus the algorithms' iteration index for SNR fixed 
at 30 dB. Observe that Fast-2L converges in roughly 30-40 iterations for both e = and e = 1/2. In 
contrast, the other algorithms require from 60 to more then 90 update iterations until convergence is 



achieved. As shown in Figs. |5(c)| and |5(f)| the number of iterations varies with the SNR. Notice that 
Fast-2L requires fewer iterations than any other considered algorithm in almost the whole SNR range 
evaluated. Furthermore, we observe that Fast-3L(e = 1) and Fast-RVM converge faster than Fast-Laplace. 
Nonetheless, Fast-Laplace leads to sparser representations. 

Next we evaluate the performance of the algorithms versus the number of available measurements 
M and the value of K for the SNR fixed at 30 dB. In these investigations we have not included Fast- 
3L(e = 1) as we can conclude from the previously presented results that it performs poorly compared 
to Fast-2L(e = 0) and Fast-2L(e = 1/2). First, we fix = 256 and K = 20 and vary M from 20 to 
120 samples. Then, for fixed M = 100 and N = 256 we vary K from 5 to M. The corresponding 
performance results are depicted in Fig. [6l In Figs. 6(a)[ |6(b)[ |6(e) and 6(f) we plot the MSE and the 
mean of K versus number of measurements M. Notice the characteristic threshold-like behavior of the 
MSE curves with respect to the number of observed measurement samples M. The threshold occurs for 
a smaller M with complex signal models as compared with real signal models. This is explained by the 
fact that in the complex case both real and imaginary parts are used to prune components in d, thus 
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Fig. 5. Performance comparison with (a)-(c) real signal models and (e)-(g) complex signal models: (a),(d) mean of K versus 
SNR; (b),(e) mean of K versus iterations; (c),(f) number of iterations used versus SNR. The grey curve in (a)-(c),(e)-(f) indicates 
K. The SNR is set to 30 dB in (b),(e). 



improving the MSE threshold. Furthermore, Fast-RVM and Fast-Laplace consistently produces a biased 
estimate of K for both real and complex signal models; in contrast, Fast-2L(e = 0) achieves an unbiased 
estimate of K for sufficiently large values of M. 

In Figs. 6(c) 6(d)[ 6(g)[ and [6(h)] we plot the MSE and the mean of K versus K. A similar threshold- 
like behavior can be observed as K increases, with the threshold occurring for a higher value of K in 
case of complex signal models, i.e., in complex scenarios less sparse signals can be estimated with higher 
accuracy. Analogously to previously discussed results, Fast-2L with small e again outperforms the other 
algorithms. In particular. Fig. |6(h)| shows that Fast-2L(e = 0) achieves an unbiased estimate of K for 
values as high as ivT = 45. 



B. Comparison with the Other Sparse Estimators 

In the following, we compare Fast-2L(e = 0) with EM-RVM, EM-Laplace, SpaRSA, and a re-weighted 
version of SpaRSA. The latter two estimators implement a proximal gradient method to optimize the 
LASSO objective function, with the re-weighted version introducing additional parameter weighting 
to improve performance. Specifically, SpaRSA optimizes the LASSO objective function with a single 
regularization parameter k that regulates the trade-off between the quadratic fit term \\y — Hol\W and 
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Fig. 6. Performance comparison witli (a)-(d) real signal models and (e)-(h) complex signal models: (a),(e) MSB versus M; 
(b),(f) mean of K versus M ; (c),(g) MSB versus K\ (d),(h) mean of K versus M. The SNR is set to 30 dB. The grey curve 
in (b),(d),(f),(h) indicates K. 



the £i-norm penalization of the weight vector ex. In [20] it is demonstrated that introducing an adaptive 
weighting for the components of a., and thus regularizing each such component separately, leads to 
improved sparse estimates. This is in many respects analogous to the estimation of the parameters -qi, 
I = I, . . . ,L, in our 3-L prior models. SpaRSA can then be applied to a modified (weighted) LASSO 
cost function as follows (see |[20l ): 

z = argmin | ^H?/ - HB^^z\\l + k\\z\\i I , (35) 

where a = B^z with B = diag(/3) and k is some fixed constant. We set k = 0.1 in all simulations. 
The components of /3 = . . . ,/3l]^ are updated according to /3; = {\ai\ + 10^'^)^^, where ai is the 
current estimate of a;, a total of 4 times [20]. Further in the text we refer to this algorithm as Re-SpaRSA. 
The motivation for using SpaRSA is its applicability for both real as well as complex models, as opposed 
to many other software packages for ii minimization. 

In our implementation of EM-RVM and EM-Laplace both algorithms remove a vector hi from H 
when the corresponding estimate of 7^ is below 10~^. SpaRSA and Re-SpaRSA efficiently prune a 
component due to the soft-thresholding effect (see (fTT])). This allow us to compare the number of estimated 
components obtained with Fast-2L(e = 0) and with EM-RVM, EM-Laplace, SpaRSA, and Re-SpaRSA. 
Since the convergence rate of EM-RVM and EM-Laplace might be very slow, we additionally restrict 
the maximum number of update iterations in all algorithms to 1000. Furthermore, we set r/ = 1 for 
EM-Laplace. 
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(d) (e) (f) 



Fig. 7. Performance comparison with com|)lex signal models: (a),(d) MSB and mean of K versus SNR; (b),(e) MSE and mean 
of K versus Ad; (c),(f) MSE and mean of K versus K. The SNR is set to 30 dB in (b),(c),(e), and (f). The grey curve in (d)-(f) 
indicates K. 



To illustrate and compare the performance of the algorithms we follow the same procedure as in the 
previous investigation. We only present the results obtained with the complex signal models as the same 
conclusions can be drawn when the algorithms are applied to real models. The corresponding results are 
summarized in Fig. U\ where we have omitted the results for SpaRSA as Re-SpaRSA exhibits a superior 
performance in all investigated scenarios. 



In Fig. |7(a)| we see that Fast-2L(e = 0) achieves the performance of the oracle estimator in terms 
of MSE almost over the whole SNR range; it is closely followed by Re-SpaRSA that, however, only 
achieves this performance for an SNR exceeding 30 dB. In contrast to the other estimators, EM -Laplace 
performs much worse, yielding an MSE significantly larger than that of the other estimators even for 
very high SNR values. The fixed value of rj explains this poor performance of EM-Laplace. Note that 
making r] adaptive, as effectively realized in Re-SpaRSA, leads to a much improved performance when 
estimating K. Nonetheless, Fast-2L(e = 0) is still clearly better: for SNR exceeding 15 dB its estimate 
of K is unbiased. 

We now study the performance of the algorithms as a function of the number of measurements M. 
The corresponding results are summarized in Figs. |7(b)| and 7(e) Note that EM-RVM has a lower MSE 
threshold - roughly 10 fewer samples are required to achieve the same MSE as compared to Fast- 
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2L(e = 0) and Re-SpaRSA. However, for M exceeding the threshold the MSE is higher for EM-RVM 
as compared to Fast-2L(e = 0) and Re-SpaRSA. Fast-2L(e = 0) and Re-SpaRSA perform best when 
estimating K. In short, the mean of K is almost constant provided M > 60 for these two algorithms. 
However, Re-SpaRSA still exhibits a bias as opposed to Fast-2L(e = 0). The mean of K returned by 
EM-Laplace and EM-RVM grows as M increases. 

Finally, we analyze the MSE and the mean of K versus K. The corresponding results are summarized in 
Figs. |7(c)"] and |7(f)] Here, Fast-2L(e = 0) yields the smallest MSE for K < 45. Furthermore, Fast-2L(e = 
0) produces much sparser estimates than the other estimators. Specifically, EM-RVM and EM-Laplace 
produces biased estimates of K in both investigations, while Re-SpaRSA yields unbiased estimates only 
for K < 20. Fast-2L(e = 0) shows a marked improvement over the rest of the estimators, as its estimate 
of K is unbiased for signals with \xp to K = 45 nonzero components. 

V. Conclusion 

In this paper we proposed a sparse Bayesian inference approach based on hierarchical prior modeling 
that applies to sparse signal representation from overcomplete dictionaries in complex as well as real 
signal models. As a starting point we model the Bessel K probability density function to cover both 
real and complex signals using a two-layer (2-L) hierarchical probabilistic structure. The approach uses 
a product of independent Gaussian scale mixtures, one for each entry of the weight vector a, with the 
variance of each mixture following a gamma distribution characterized by a common shape parameter e 
and an entry-specific scale parameter r]i. The choice e = 1 in case of real weights and e = 3/2 in case 
of complex weights leads to a Laplace prior for each entry in cc. In addition, the selection rji = rj, VZ, 
with these two choices of e induces the £i penalty term for the maximum a posteriori (MAP) estimator. 
Naturally, other values of e > can be utilized. This additional degree of freedom leads to novel priors for 
OL with strong sparsity-inducing properties. Specifically, it was shown that the selection e = encourages 
a sparser solution than the Laplace prior does. Furthermore, in case of an orthonormal dictionary matrix 
H the MAP estimator is a generalization of the well-known soft- thresholding rule. 

We also considered a further extension of the 2-L prior model by modeling the hyperparameters i]i as 
independent gamma distributed random variables. The new three layer (3-L) model also generalizes to 
complex as well as real weights with different sparsity-inducing properties. In contrast to the case of the 
2-L model, when H is orthonormal, the MAP estimator is a generalization of the hard-thresholding rule. 

Finally, we applied the fast Bayesian inference scheme, originally proposed by M. Tipping, to estimate 
the parameters of the 2-L and 3-L hierarchical prior models. The numerical results demonstrate how 
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these hierarchical structures improve the performance, both in terms of convergence speed, sparseness, 
and mean-squared error of the estimators as compared to nowadays' sparse algorithms obtained based 
on traditional prior models. 

The proposed Bayesian algorithms are very efficient, yet analytically tractable and of low computational 
complexity for sparse signal representation in real and complex signal models. Our numerical results show 
a very significant performance improvement over state-of-the-art sparse estimators in low and moderate 
signal-to-noise ratio regimes, in which these estimators fail to produce sparse solutions. 



Appendix 

Technical Lemmas and Proofs for SectionHIT] 
We start out by computing the derivative of £{'yi) (see (l27l)): 



nil) 



il-e + p)sf + 2fiisi + 2{e - l)si - SIP + plqil"^ - fii +7, ^(e-1) 



(36) 



For the proofs of Propositions [U and |2] we define the function f{'yi) = 7^(1 + 7/sO^^'(70 corresponding 
to the right-hand polynomial in (1281 ): 

fill) = -ifmsf - if [(1 - e + p)sf + 2fiisi'\ + 7/ [2(e - l)si - sip + p\qi\^ - r};] + (e - 1). (37) 

Notice that: (i) ^'(7/) and /(7i) share the same roots on IR+ con^esponding to the stationary points of 
£(7/); (ii) For any positive 7/, ^(7;) and /(7«) have the same sign. Hence, we can study the stationary 
points of from the analysis of /(7i) on M+. Finally, the derivative of f{'yi), 



fill) = -^liViSi -2ji {I - e + p)si +2fjisi + 2{e-l)si- sip + p\qi 

is a concave parabola that crosses the ^i-axis at the two points 

{-) _ -sfjl -e + p) - 2sifji - siy/di 



m 



(38) 



li 



li 



{+) 



-sf{l - e + p) - 2sifii + si^/di 



(39) 
(40) 



Here, di = sf{l — e + p)^ — 2s/f/i + sifji{2e + p) + fii{fii + 3\qi\'^p). Note that ^y^'^ is either complex (and 
so is 7;^^b or negative for e < 1 + p. 

We begin by analyzing the case in which e < 1, as assumed in Proposition [T] We will need the 
following lemma: 
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Lemma 1: Provided e < 1, if the function /(7i) has positive roots then it has exactly two of them. 
Proof: Assume that e < 1. As /(O) = e — 1 < and lim-y,_s>_oo fill) = oo, f{ii) has an odd 
number of negative roots (either one or three). Hence, if f{ii) has positive roots it must have exactly 
two of them. ■ 

Using Lemma [T] we can now prove Proposition [1] 

Proposition 1: Provided e < 1 the function £(7;) has at most a single local maximum. 

Proof: Assume that e < 1. From Lemma[T] f{ii) has either exactly two positive roots ^y^^^ and 
or it has no such root at all. In the latter case £{11) has no local maximum!^ In the former case, we 
distinguish between two different possibilities: (i) 7^^ = 7^'' = 7^"'"^ and (ii) < 7^^ < 7^+^ < 7^^ If 
(i) holds true, 7^^^^ = 7^^ = 7^"'') corresponds to a saddle point of £{11). Indeed, since /(O) = (e — 1) < 
and lim^,_j.oo fill) = —00, 7^^ = 7^^ = 7^+) corresponds to the maximum of fdi) and, hence, fdi) 
is negative on both intervals (0,7)^^^) and (7|^^\oo). This implies that £{11) monotonically decreases on 

these two intervals, hence, 7^^^^ corresponds to a saddle point of i{ii). If (ii) holds true, the rightmost root 

(2) 

7^ must necessarily correspond to a local maximum. Since /'(7;) is a concave parabola with di > and 
/(O) < 0, the function £{ii) decreases on the interval (0,7^^) because fdi) is negative on this interval, 
then increases on (7}^'',7p'') as fdi) is positive on this interval, and finally decreases towards —00 on 
the interval (7p'',oo). Thus, 7]'^^ corresponds to the local minimum of £{11) and 7^^ corresponds to the 
only local maximum of ^(7;). ■ 
We now turn to the proof of Proposition |2] We begin by proving the following lemma: 
Lemma 2: Provided 1 < e < 1 + p the function f{n) has a positive root of multiplicity one or three. 
Proof: Assume 1 < e < 1 + p. The function f{ii) has at least a positive root as /(O) = (e — 1) > 
and lim^,_^oo fill) = —00. Moreover, f'{ii) can at most change sign once over as 7|'~'* is either 
negative or complex for e < 1 + p. Thus, f{ii) has a positive root of multiplicity one or three. ■ 
Now, we are ready to state Proposition |2l 

Proposition 2: Provided 1 < e < 1 + p the function ^(7;) has a single stationary point and this 
stationary point corresponds to a maximum. 

Proof: From Lemma |2l f{ii) has a positive root 7^^*-* of multiplicity either one or three. Hence, 
l{li) has a single stationary point. Moreover, as /(O) = (e — 1) > 0, f{ii) is positive on the interval 
(0)7;^*^) and negative on the interval {i\*\oo), which imphes that this stationary point corresponds to 
the maximum of ^(7/). ■ 

"Remember that the definition domain of is R"*". 
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